Street-level clustering analysis of public space in London borough of Camden.

This tutorial provides a detailed description of the clustering analysis of streets of Camden. It guides through the process of data clearing and processing, clustering algorithm, cluster exploration and correlation with crime levels on the streets. The aim of the analysis is to determine which streets are woman-friendly and have a potential to make women feel safe while walking, and therefore the data used includes the width of the pavements, street lights, public toilets and urban greenery.

1. Data

In the analysis, the following data was used:

First, let’s load some libraries we’ll be using

library(sf)
library(here)
library(tmap)
library(tidyverse)
library(leaflet)
library(leafgl)
library(mapdeck)
library(RColorBrewer)
library(dplyr)
library(stringr)
library(ggplot2)

1.1 Data loading and initial checks

Let’s read in London Boroughs data, and make a Camden Outline layer, so we can clip other data to it.

# Reading in London boundaries data
LondonBoroughs <- st_read(here::here("data",
                            "statistical-gis-boundaries-london",
                            "ESRI",
                            "London_Borough_Excluding_MHW.shp"))

# here, we create a CamdenOutline to clip other data to it
CamdenOutline <- LondonBoroughs %>% 
  filter(., NAME=="Camden")

We want all our data to be in a same projection - EPSG: 27700, so let’s make sure CamdenOutline is. Plot it to see if everything looks alright.

# check the projection
print(CamdenOutline)

# reproject Camden
CamdenOutlineProjected <- CamdenOutline %>%
  st_transform(.,27700)
# let's see CamdenOutline on a map
tmap_mode("view")
qtm(CamdenOutline)

Everything looks alright so let’s proceed with the pavements width data. It was firstly processed in QGIS with the function dissolve, as in the original dataset streets are split into smaller bits and the analysis requires data to be on the street level. The file was then saved as geojson.

# Camden streets - pavement width
# reading in file
streetsGJSON <- st_read(here::here("data", "streets3.geojson"))

# check the projection
print(streetsGJSON)

# change the projection
streets <- streetsGJSON %>%
  st_transform(.,27700)
print(streets)

The pavements width data covers the whole London, so let’s clip it with CamdenOutline.

# get rid of the streets outside Camden
pavementsWidth <- streets[CamdenOutlineProjected,]

Let’s see how it looks like on a map.

# see the result
qtm(pavementsWidth)

We’re going to repeat the previous steps with street lights, trees and public toilets data.

# Camden street lights
streetLights <- st_read(here::here("data",
                             "Camden Street Lighting",
                             "geo_export_f77a2cc3-40d7-403c-84fe-64be970169bf.shp"))


# Camden public toilets
publicToiletsCSV <- read_csv(here::here("data",
                               "Public_Conveniences_In_Camden_Map.csv"))

publicToilets <- st_as_sf(publicToiletsCSV,
                          coords = c("Longitude","Latitude"),
                          crs = 4326)

# trees in Camden
treesCamdenCSV <- read_csv(here::here("data",
                             "Trees_In_Camden.csv"))

1.2 Data clearing and preparation

The aim of this part is to a prepare one dataset for clustering. To do so, firstly, we need to clear and prepare our data for joining.

1.2.1 Pavements width data

Let’s check for any missing values.

# Pavements Width Data

# check if there are any nulls/nas
pavementsWidth %>% 
  summarize_all(funs(sum(is.na(.))))
## Simple feature collection with 1 feature and 8 fields
## geometry type:  MULTILINESTRING
## dimension:      XY
## bbox:           xmin: 523973.4 ymin: 180966 xmax: 531554.6 ymax: 187481.4
## projected CRS:  OSGB 1936 / British National Grid
##   fid id DISTNAME ROADNUMBER CLASSIFICA foW caW toW
## 1   0  0        1        961          0   0   0   0
##                         geometry
## 1 MULTILINESTRING ((525845 18...

There are missing values in the road number column, but we’re not going to use that information so there’s no need deleting them. Here, we’re also creating a completeData datset, where we will store all our important variables. We’re creating it by taking pavementsWidth dataset and adding a new column - propFowTow, which is a proportion of footpath width to the total street width. Furthermore, we’re calculating the length of each street and store it in our dataset.

# add a new column to see a proportion of pavement the total street width
# and create a new dataframe to store all the data needed for the analysis
completeData <- transform(pavementsWidth, propFowTow = foW/toW)

# calculate the length of a street and add it to completeData dataset
completeData$streetLength <- st_length(completeData$geometry)

# check whether a new column is numeric
is.numeric(completeData$streetLength)

# transform it to numeric
completeData$streetLength <- as.numeric(completeData$streetLength)

Let’s round the numbers, check for any missing values, and plot one of the variables to see if everything looks fine.

# rounding the columns
completeData <- completeData %>% 
  mutate_if(is.numeric, ~round(., 2))

# check for nans
sum(is.na(completeData$streetLength))
# see how it looks like
tm_shape(completeData)+
  tm_lines("propFowTow", 
           palette = "RdYlGn",
           direction=-1)

1.2.2 Public Toilets Data

We’re going to add public toilets to our complete dataset by the name of the street they are located at. First, let’s check if all the public toilets have that information.

### Public Toilets Data

# check if there are null values in street name column
sum(is.na(publicToilets$Street))

# remove values where the street name is not known
publicToilets <- publicToilets %>% 
  drop_na(Street)

Once it’s done, we can join it to our complete dataset. We only want to add columns Name and Street. Based on that, we’re creating a new 0-1 column to indicate whether there is a public toilet located by the street or not.

# add public Toilets to complete dataset
# select only columns I want to add
toiletsStreet <-  publicToilets %>% 
  select(Name,Street)

# join
completeData <- left_join(completeData,toiletsStreet, by= c("DISTNAME"="Street"))

# add a new column 0-1 based on whether there is a toilet by the street or not
completeData$publicToilet <- ifelse(is.na(completeData$Name),0, 1)

# add a new column with public toilets per 10 m of the street
completeData$toilets10m <- (completeData$publicToilet*10)/completeData$streetLength
# check if worked
head(completeData)
## Simple feature collection with 6 features and 13 fields
## geometry type:  MULTILINESTRING
## dimension:      XY
## bbox:           xmin: 525658 ymin: 181659 xmax: 530661 ymax: 186393
## projected CRS:  OSGB 1936 / British National Grid
##     fid    id             DISTNAME ROADNUMBER CLASSIFICA   foW   caW   toW
## 1 73888 72914    Templewood Avenue       <NA> Local Road  3.25 12.20 15.45
## 2 67276 66264      The Old Orchard       <NA> Local Road  0.00  4.89  4.89
## 3 62098 61046 Little Albany Street       <NA> Local Road  0.93  5.23  6.16
## 4 74037 73063          Branch Hill       <NA> Minor Road 10.87 11.61 22.48
## 5 91125 90367      Red Lion Square       <NA> Local Road  1.28 13.86 15.14
## 6 71402 70418           Ornan Road       <NA> Local Road  4.76  7.86 12.62
##   propFowTow streetLength Name                       geometry publicToilet
## 1       0.21       457.52 <NA> MULTILINESTRING ((525845 18...            0
## 2       0.00        71.86 <NA> MULTILINESTRING ((527514 18...            0
## 3       0.15        38.77 <NA> MULTILINESTRING ((528871 18...            0
## 4       0.48       267.53 <NA> MULTILINESTRING ((526017.2 ...            0
## 5       0.08       245.11 <NA> MULTILINESTRING ((530571 18...            0
## 6       0.38       252.50 <NA> MULTILINESTRING ((527184 18...            0
##   toilets10m
## 1          0
## 2          0
## 3          0
## 4          0
## 5          0
## 6          0

1.2.3 Street Lights Data

Firstly, let’s check the projection, see how the data looks like and plot the street lights.

# check the projection
print(streetLights)

# see how the data looks like
head(streetLights)

# change the crs of streetLights
streetLights <- streetLights %>%
  st_transform(.,27700)
print(streetLights)
qtm(streetLights)

The way the data is structured makes the join a little problematic, so first, we need to calculate how many street lamps there are on each street.

# extract the unique street names
streetNames <-  c(unique(streetLights$street_nam))

# check if there are streets with the same names but different ids in completeData
library(plyr)
count(completeData,c('id','DISTNAME'))

# how many times the street names appear in streetLights dataset
count(streetLights, c('street_nam'))

# convert it to a df
numberOfStreetLightsDF <- data.frame(count(streetLights, c('street_nam')))

# to join two dataframes, we need to make sure street names in both datasets look similar
numberOfStreetLightsDF$street_nam=tolower(numberOfStreetLightsDF$street_nam)
completeData$DISTNAME=tolower(completeData$DISTNAME)

# join
completeData <- completeData %>% 
  left_join(numberOfStreetLightsDF, by = c("DISTNAME" = "street_nam"))

Let’s see if it worked

head(completeData)
## Simple feature collection with 6 features and 14 fields
## geometry type:  MULTILINESTRING
## dimension:      XY
## bbox:           xmin: 525658 ymin: 181659 xmax: 530661 ymax: 186393
## projected CRS:  OSGB 1936 / British National Grid
##     fid    id             DISTNAME ROADNUMBER CLASSIFICA   foW   caW   toW
## 1 73888 72914    templewood avenue       <NA> Local Road  3.25 12.20 15.45
## 2 67276 66264      the old orchard       <NA> Local Road  0.00  4.89  4.89
## 3 62098 61046 little albany street       <NA> Local Road  0.93  5.23  6.16
## 4 74037 73063          branch hill       <NA> Minor Road 10.87 11.61 22.48
## 5 91125 90367      red lion square       <NA> Local Road  1.28 13.86 15.14
## 6 71402 70418           ornan road       <NA> Local Road  4.76  7.86 12.62
##   propFowTow streetLength Name publicToilet toilets10m freq
## 1       0.21       457.52 <NA>            0          0   16
## 2       0.00        71.86 <NA>            0          0   NA
## 3       0.15        38.77 <NA>            0          0   NA
## 4       0.48       267.53 <NA>            0          0   10
## 5       0.08       245.11 <NA>            0          0   12
## 6       0.38       252.50 <NA>            0          0    8
##                         geometry
## 1 MULTILINESTRING ((525845 18...
## 2 MULTILINESTRING ((527514 18...
## 3 MULTILINESTRING ((528871 18...
## 4 MULTILINESTRING ((526017.2 ...
## 5 MULTILINESTRING ((530571 18...
## 6 MULTILINESTRING ((527184 18...

Now we’ve got the numbers of street lamps added, let’s check for null values in that column and change them to 0.

# check for nas in number of streetLights
sum(is.na(completeData$freq))

# there is a lot of nas in number of street lights, so let's replace
# them with 0
completeData$freq <- completeData$freq %>% 
  replace_na(0)

# check if worked
sum(is.na(completeData$freq))

Now, we need to calculate how many street lamps there are per 10 m of each street. We’re going to do the same thing later on with other variables, so we have a consistent measure.

# calculate how many street lights are per 10 m of a street
# adding a field - how many street lights per 10 m?
completeData$streetLights10m <- (completeData$freq*10)/completeData$streetLength

1.2.4 Trees Data

So far, we’ve only read trees data without converting it to an sf object from a CSV. That’s because there are null values in the location column. To convert it, we need to get rid of them first. Then, we’re clipping it to our CamdenOutline shape and repeating the procedure of the street lights processing.

# check for nas
sum(is.na(treesCamdenCSV$Location))

# okay, theres a lot (72) so lets remove them
treesCamdenCSV <- treesCamdenCSV %>%
  drop_na(Location)

# convert to sf
treesCamden <- treesCamdenCSV %>%
  st_as_sf(., coords = c("Longitude", "Latitude"),
           crs = 4326) %>%
  st_transform(., 27700)

# let's remove the ones outside Camden
treesCamden <- treesCamden[CamdenOutlineProjected,]

# making a string of street names from completeData
completeStreetNames <-  c(unique(completeData$DISTNAME))

# lowering the names of sites in treesCamden
treesCamden$`Site Name`=tolower(treesCamden$`Site Name`)

# select trees only by the streets
streetTrees <- data.frame(treesCamden$`Site Name`,treesCamden$`Number Of Trees`)

# aggregating, so we know how many trees are there on the street
streetTrees <- aggregate(streetTrees$treesCamden..Number.Of.Trees., by=list(site=streetTrees$treesCamden..Site.Name.), FUN=sum)

# select only rows with street names with completeDataset street names
streetTrees2 <- streetTrees %>% 
  filter(str_detect(streetTrees$site, paste(completeStreetNames,collapse = '|')))

# join
completeData <- left_join(completeData,streetTrees2, by= c("DISTNAME"="site"))

# change nas in trees to zeroes
completeData$x <- ifelse(is.na(completeData$x),0, completeData$x)

# adding a column with a number of trees per 10m
completeData$trees10m <- (completeData$x*10)/completeData$streetLength

1.3 The whole dataset processing

Now we’ve got the whole dataset ready, let’s change the names of columns so we know what we’re looking at and get rid of the variables we’re not going to include in our analysis.

# renaming the column names 
completeData <- completeData %>% 
  dplyr::rename(
    streetName = DISTNAME,
    publicToiletName=Name,
    numberOfStLights=freq,
    numberOfTrees=x
  )

#get rid of the columns we dont need
completeDataNumbers <- subset(completeData, select=-c(id, ROADNUMBER, CLASSIFICA, publicToiletName, publicToilet))

# check if any nas
completeDataNumbers %>% 
  summarize_all(funs(sum(is.na(.))))

# there are nas in street names, but I'm not going to remove 
# them, as they still have geometry

Here we’re subsetting variables needed for clustering and saving it into dataFinal dataset

# and subset data needed for clustering
dataFinal <- subset(completeDataNumbers, select=-c(foW, caW, toW, streetLength,numberOfStLights, numberOfTrees))

# check the data types
str(dataFinal)

As we’ve got the data ready we can move on to checking the distributions of the variables. Firstly though, let’s check for infinite values and get rid of them.

# rounding the columns
dataFinal <- dataFinal %>% 
  mutate_if(is.numeric, ~round(., 2))

# check for infinite values 
sum(!is.finite(dataFinal$streetLights10m))
sum(!is.finite(dataFinal$propFowTow))
sum(!is.finite(dataFinal$trees10m))
sum(!is.finite(dataFinal$toilets10m))

# there is one infinite value here let's remove it
# changing inf to na
dataFinal <- dataFinal %>%
  mutate_if(is.numeric, list(~na_if(., Inf)))

#dropping na
dataFinal<- dataFinal %>%
  drop_na('streetLights10m')

# check if worked
sum(!is.finite(dataFinal$streetLights10m))

# histograms with distribution
propFowTowHist <- ggplot(data=dataFinal, aes(`propFowTow`)) + 
  geom_histogram(
    color="#E69F00",
    fill="white")

streetLights10mHist <- ggplot(data=dataFinal, aes(`streetLights10m`)) + 
  geom_histogram(
    color="#E69F00",
    fill="white")

treesHist <- ggplot(data=dataFinal, aes(`trees10m`)) + 
  geom_histogram(
    color="#E69F00",
    fill="white")

toiletsHist <- ggplot(data=dataFinal, aes(`toilets10m`)) + 
  geom_histogram(
    color="#E69F00",
    fill="white")

Let’s plot the histograms next to each other.

# to see the histogram next to each other
library(gridExtra)
grid.arrange(propFowTowHist, streetLights10mHist, treesHist, toiletsHist, ncol=2)

Another way of visualising the distributions

# cool distribution plots
library(cluster.datasets)
plot1 <- dataFinal %>% 
  ggplot(aes(x = "streets", y = propFowTow)) + 
  geom_jitter(width = .025, height = 0, size = 2, alpha = .5, color = "#2b85be") +
  labs(x = "", y="proportion of footpath to total street width")
plot1

plot2 <-  dataFinal %>% 
  ggplot(aes(x = "streets", y = streetLights10m)) + 
  geom_jitter(width = .02, height = 0, size = 2, alpha = .6,  color = "#00A08F") +
  labs(x = "", y="street lights per 10m of a street")

plot3 <- dataFinal %>% 
  ggplot(aes(x = 'streets', y = trees10m)) + 
  geom_jitter(width = .02, height = 0, size = 2, alpha = .6,  color = "#EBCC72") +
  labs(x = "", y="trees per 10m of a street")

plot4 <-  dataFinal %>%
  ggplot(aes(x = 'streets', y = toilets10m)) + 
  geom_jitter(width = .02, height = 0, size = 2, alpha = .6,  color = "#F99227") +
  labs(x = "", y="public toilets per 10m of a street")
grid.arrange(plot1, plot2, plot3, plot4)

1.4 Data transformation and standardization

1.4.1 Transformation

Only one of the variables (propFowTow) seems to have a normal distribution so let’s transform other variables with log transformation and see if it reduces the skewness.

# log transformation, getting rid of the skewness on street lights data
streetLights10mLogHist <- ggplot(dataFinal, aes(x=log(streetLights10m))) + 
  geom_histogram(color="#E69F00",
                 fill="white") +
  geom_density()

# and trees
treesLogHist <- ggplot(dataFinal, aes(x=log(trees10m))) + 
  geom_histogram(color="#E69F00",
                 fill="white") +
  geom_density()

# propFowTow
propLogHist <- ggplot(dataFinal, aes(x=log(propFowTow))) + 
  geom_histogram(color="#E69F00",
                 fill="white") +
  geom_density()

# toilets
toiletsLogHist <- ggplot(dataFinal, aes(x=log(toilets10m))) + 
  geom_histogram(color="#E69F00",
                 fill="white") +
  geom_density()
grid.arrange(streetLights10mLogHist, treesLogHist, propLogHist, toiletsLogHist)

Let’s compare mean values of the variables before and after the log transformation.

# compare mean values before and after log
logData <- data.frame(log(dataFinal$propFowTow),
                      log(dataFinal$toilets10m),
                      log(dataFinal$streetLights10m),
                      log(dataFinal$trees10m))

summary(logData)
##  log.dataFinal.propFowTow. log.dataFinal.toilets10m.
##  Min.   :   -Inf           Min.   : -Inf            
##  1st Qu.:-1.1087           1st Qu.: -Inf            
##  Median :-0.9163           Median : -Inf            
##  Mean   :   -Inf           Mean   : -Inf            
##  3rd Qu.:-0.7765           3rd Qu.: -Inf            
##  Max.   : 0.0000           Max.   :-2.04            
##  log.dataFinal.streetLights10m. log.dataFinal.trees10m.
##  Min.   :   -Inf                Min.   :   -Inf        
##  1st Qu.:-1.2379                1st Qu.:   -Inf        
##  Median :-0.9676                Median :-1.9661        
##  Mean   :   -Inf                Mean   :   -Inf        
##  3rd Qu.:-0.6931                3rd Qu.:-0.5978        
##  Max.   : 1.1184                Max.   : 2.2875
summary(dataFinal)
##       fid          streetName          propFowTow       toilets10m       
##  Min.   :   605   Length:1090        Min.   :0.0000   Min.   :0.0000000  
##  1st Qu.: 62040   Class :character   1st Qu.:0.3300   1st Qu.:0.0000000  
##  Median : 66579   Mode  :character   Median :0.4000   Median :0.0000000  
##  Mean   : 68107                      Mean   :0.3904   Mean   :0.0008257  
##  3rd Qu.: 74054                      3rd Qu.:0.4600   3rd Qu.:0.0000000  
##  Max.   :273078                      Max.   :1.0000   Max.   :0.1300000  
##  streetLights10m     trees10m                 geometry   
##  Min.   :0.0000   Min.   :0.0000   MULTILINESTRING:1090  
##  1st Qu.:0.2900   1st Qu.:0.0000   epsg:27700     :   0  
##  Median :0.3800   Median :0.1400   +proj=tmer...  :   0  
##  Mean   :0.3863   Mean   :0.3263                         
##  3rd Qu.:0.5000   3rd Qu.:0.5500                         
##  Max.   :3.0600   Max.   :9.8500
variance <- data.frame(var(dataFinal$propFowTow),
                       var(dataFinal$toilets10m),
                       var(dataFinal$streetLights10m),
                       var(dataFinal$trees10m),
                       var(logData$log.dataFinal.propFowTow.),
                       var(logData$log.dataFinal.toilets10m.),
                       var(logData$log.dataFinal.streetLights10m.),
                       var(logData$log.dataFinal.trees10m.))

The log transformation makes variables more normally distributed, however it also results in many infinite values. K-means clustering requires the data to have similar mean values and variances therefore we will proceed with the previous version of our dataset.

1.4.1 Normalization

Here, we’re going to normalize our data so it exist within the same range. Let’s take only numeric values from our dataset.

clusteringDataset2 <- data.frame(dataFinal$propFowTow,
                                 dataFinal$streetLights10m,
                                 dataFinal$trees10m,
                                 dataFinal$toilets10m)
head(clusteringDataset2)
# perform normalisation here
clusteringDatasetNormalised <- data.frame(scale(clusteringDataset2, center = TRUE, scale = TRUE))

summary(clusteringDatasetNormalised)
##  dataFinal.propFowTow dataFinal.streetLights10m dataFinal.trees10m
##  Min.   :-2.76794     Min.   :-1.44180          Min.   :-0.6453   
##  1st Qu.:-0.42826     1st Qu.:-0.35946          1st Qu.:-0.6453   
##  Median : 0.06804     Median :-0.02356          Median :-0.3685   
##  Mean   : 0.00000     Mean   : 0.00000          Mean   : 0.0000   
##  3rd Qu.: 0.49343     3rd Qu.: 0.42431          3rd Qu.: 0.4423   
##  Max.   : 4.32200     Max.   : 9.97878          Max.   :18.8333   
##  dataFinal.toilets10m
##  Min.   :-0.1313     
##  1st Qu.:-0.1313     
##  Median :-0.1313     
##  Mean   : 0.0000     
##  3rd Qu.:-0.1313     
##  Max.   :20.5432

Let’s check if the variables are correlated.

# correllation matrix - checking if not correlated
# need to work on the graphics here
library(corrplot)
source("http://www.sthda.com/upload/rquery_cormat.r")


col<- colorRampPalette(c("#2b85be","#EBCC72","#F99227"))(20)
rquery.cormat(clusteringDatasetNormalised, col=col, type='full')

The variables do not seem correlated, so let’s proceed with the analysis

clusteringDatasetFinal <- data.frame(dataFinal$streetName,clusteringDatasetNormalised)

# renaming the streetName column 
clusteringDatasetFinal <- clusteringDatasetFinal %>% 
  dplyr::rename(
    streetName = dataFinal.streetName)

head(clusteringDatasetFinal)

2. The analysis

Now we’re going to proceed with the K-Means clustering. Firstly let’s load the packages

library(cluster)    
library(factoextra) 
library(NbClust) 

2.1 Choosing the right number of clusters

This method of clustering requires the number of clusters as an imput. To choose the right number of clusters we’re going to use the Elbow method, Silhuette method and NBClust() function.

# choosing the right value of k
# Elbow method
fviz_nbclust(clusteringDatasetNormalised, kmeans, method = "wss",linecolor = "#ed7d31") +
  geom_vline(xintercept = 5, linetype = 2) +
  labs(subtitle = "Elbow method") 

the number of clusters suggested here is 5, let’s try silhuette

# Silhouette method
fviz_nbclust(clusteringDatasetNormalised, kmeans, method = "silhouette",linecolor = "#ed7d31") +
  labs(subtitle = "Silhouette method")

according to the silhuette method, the number of clusters should be 9, let’s try the last estimation

nbclust_out <- NbClust(
  data = clusteringDatasetNormalised,
  distance = "euclidean",
  min.nc = 2, 
  max.nc = 10, 
  method = "kmeans")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
## 

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 5 proposed 3 as the best number of clusters 
## * 7 proposed 5 as the best number of clusters 
## * 5 proposed 8 as the best number of clusters 
## * 2 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  5 
##  
##  
## *******************************************************************
# a dataframe of the optimal number of clusters
nbclust_plot <- data.frame(clusters = nbclust_out$Best.nc[1, ])
# select only indices which select between 2 and 10 clusters
nbclust_plot <- subset(nbclust_plot, clusters >= 2 & clusters <= 10)

# create plot
ggplot(nbclust_plot) +
  aes(x = clusters) +
  geom_histogram(fill = "#ed7d31") +
  labs(x = "Number of clusters", y = "Frequency among all indices", title = "Optimal number of clusters") 

The last one suggests 5 as well so let’s proceed with 5 clusters.

2.2 K-means clustering

As we know our number of clusters already, we can go for the clustering.

set.seed(123)
clustered <- kmeans(clusteringDatasetNormalised, 5, nstart = 200)

# add a cluster number to our dataset
dataFinal$cluster <- clustered$cluster

Let’s see the result on a map. I’m using a mapdeck package here which requires a personal token. That could be easily obtained from mapbox website after registering.

# need to change the projection to view data using mapdeck
dataFinal <- dataFinal %>%
  st_transform(.,4326)

# mapdeck style set to dark
ms <- mapdeck_style("dark")

# a map showing clusters
mapdeck(style=ms, token = token) %>%
  add_path(
    data = dataFinal,
    stroke_colour = "cluster",
    auto_highlight = TRUE,
    tooltip = "streetName",
    palette = colorRamp( c("#2b85be", "#00A08F", "#EBCC72", "#F99227", "#F66747"))( (1:256)/256 ),
    legend = TRUE,
    legend_options = (list(title = "Cluster"))
  ) %>%
  mapdeck_view(
    location = c(-0.1486, 51.5406),
    zoom = 11.5
  )

2.3 Exploring the clusters

library(GGally)
library(plotly)

Let’s plot the variables here and see how the numbers change for each cluster.

dataFinal$cluster <- as.factor(clustered$cluster)

p <- ggparcoord(data = dataFinal,
                columns = c(3:6), 
                mapping=aes(color=as.factor(cluster))) +
  scale_color_discrete("cluster") +
  labs(x = "variable", 
       y = "value", 
       title = "Clustering")

ggplotly(p)

Here, we’re calculating mean values of each variable in a cluster to define them

# a dataframe with calculated mean values from each cluster
meanClusterValues <- dataFinal %>% 
  group_by(cluster) %>% 
  summarise(
    propFowTow = mean(propFowTow),
    streetLights10m=mean(streetLights10m),
    trees10m=mean(trees10m),
    toilets10m=mean(toilets10m))

# a different one
plotAll <- ggparcoord(data = meanClusterValues,
                      columns = c(2:5), 
                      mapping=aes(color=as.factor(cluster))) +
  scale_color_discrete("cluster") +
  labs(x = "variable", 
       y = "value", 
       title = "Clustering")
plotAll

ggplotly(plotAll)

# let's define the barplots for each variable
plotMeanProp <- ggplot(meanClusterValues, aes(x = cluster, y = propFowTow, fill= cluster)) +
  geom_col(position = "dodge")

plotStreetLights <- ggplot(meanClusterValues, aes(x = cluster, y = streetLights10m, fill= cluster)) +
  geom_col(position = "dodge")

plotTrees <- ggplot(meanClusterValues, aes(x = cluster, y = trees10m, fill= cluster)) +
  geom_col(position = "dodge")

plotToilets <- ggplot(meanClusterValues, aes(x = cluster, y = toilets10m, fill= cluster)) +
  geom_col(position = "dodge")

and put it all together

library(gridExtra)
grid.arrange(p,plotAll,plotMeanProp,plotStreetLights,plotTrees,plotToilets)

Let’s rank our clusters according to levels of women-friendliness

# cluster rank
dataFinal <- dataFinal %>%
  mutate(clusterRank = case_when(cluster==1 ~ 2,
                                 cluster==2 ~ 1,
                                 cluster==3 ~ 5,
                                 cluster==4 ~ 3,
                                 cluster==5 ~ 4))

and plot our results with mapdeck

# a map showing ranked clusters
mapdeck(style=ms, token = token) %>%
  add_path(
    data = dataFinal,
    stroke_colour = "clusterRank",
    auto_highlight = TRUE,
    tooltip = "streetName",
    palette = colorRamp(c("#2b85be", "#00A08F", "#EBCC72", "#F99227", "#F66747"))( (1:256)/256 ),
    legend = TRUE,
    legend_options = (list(title = "Cluster"))
  ) %>%
  mapdeck_view(
    location = c(-0.1486, 51.5406),
    zoom = 11.5
  )

3. Comparing the results with crime data

Now we’ve identified clusters, we can see which cluster has the highest number of sex offenses. Let’s load the crime data first and filter sex offenses.

# read in crime data
allCrimeCSV <- read_csv(here::here("data", 
                                "On_Street_Crime_in_Camden.csv"))  
allCrime <- st_as_sf(allCrimeCSV,
                     coords = c("Longitude","Latitude"),
                     crs = 27700)

# filter violance and sexual offences
sexOff <- allCrime %>% 
  filter(str_detect(Category,"Violence and sexual offences"))

Create a new dataframe with sex offences. Crime on street data is structured in a similar way to street lights and trees data, so we need to aggregate it first. Then we can join it with cluster data.

sexOff <- data.frame(sexOff$Category,sexOff$`Street Name`)

# add a column with a number, so I can aggregate
sexOff$number <- 1

sexOffCount <- aggregate(sexOff$number, by=list(site=sexOff$sexOff..Street.Name.), FUN=sum) 

# lowering letters so I can smoothly join
sexOffCount$site <- tolower(sexOffCount$site)

# trying to do the join
library(fuzzyjoin)
library(stringr)

sexOffCountJoin <- sexOffCount %>% 
  fuzzy_inner_join(dataFinal, by = c("site" = "streetName"), match_fun = str_detect)

sexOffCountJoin <- subset(sexOffCountJoin, select=-c(site))

# renaming the crime column
sexOffCountJoin <- sexOffCountJoin %>% 
  dplyr::rename(
    numberOfOffences = x)

# group see how many crimes in each cluster
crimeInClusters <- sexOffCountJoin %>% 
  group_by(cluster) %>% 
  summarise(
    crime = sum(numberOfOffences))

Now we can see if there is a correlation between crime rates and our clusters and their variables.

sexOffCorr <- subset(sexOffCountJoin, select=-c(fid, streetName,geometry))

# see if there is a correlation between crime and clusters and other variables
library(ggcorrplot)

sexOffCorr$cluster <- as.numeric(sexOffCorr$cluster)

#correlation matrix
corr <- round(cor(sexOffCorr), 2)
p.mat <- cor_pmat(sexOffCorr)
head(p.mat[, 1:4])
ggcorrplot(corr,hc.order = TRUE, type = "lower",
           outline.col = "white", lab = TRUE,
           ggtheme = ggplot2::theme_gray,
           colors = c("#2b85be","#fff5c5","#F99227"))